Natural Covariate-adjusted Graphical Regression

Robin Liu

Department of Statistics and Applied Probability, UC Santa Barbara

Understanding Gene Networks

Signaling pathways associated with Glioblastoma multiforme (GBM)

Gaussian Graphical Models

Encodes the conditional independence structure of Gaussian random variables

Let \(\mathbf{X} \sim \mathcal{N}_p(\mathbf{0}, \Sigma)\) and define the precision matrix \(\Omega = \Sigma^{-1}\).

Key property: \(X_j \perp\!\!\!\perp X_k \mid X_{\setminus\{j, k\}} \iff \Omega_{jk} = 0\)

Estimating the conditional independence of \(p\) Gaussian random variables amounts to estimating the sparsity of the precision matrix \(\Omega\).

Gaussian Graphical Models

Estimated graph of \(p=73\) gene expressions related to GBM

eQTL Analysis

  • Genetic variants at a DNA locus may mediate the co-expression of genes (Fehrmann et al. 2011).
  • We are concerned with quantitative trait loci (QTL) of a single nucleotide, or single nucleotide polymorphisms (SNPs).

  • Problem statement: Given data on gene expressions and additional genetic markers, identify the SNPs that mediate the co-expression of genes. Known as expression QTL (eQTL) analysis.

  • We must incorporate external covariate information into the graphical model.

Covariate-adjusted Graphical Models

  • Suppose in addition to responses \(\mathbf{X} \in \mathbb{R}^p\) we observe a vector \(\mathbf{U}\in\mathbb{R}^q\) of covariates.

  • Want to model the conditional distribution \(\mathbf{X} \mid \mathbf{U} = \mathbf{u} \sim \mathcal{N}_p(\mu(\mathbf{u}), \Sigma(\mathbf{u}))\) using a graphical model.

  • Many works allow the mean expression level to depend on covariates, but few allow the graph structure to be adjusted by covariates.

Zhang & Li (2022) propose an additive structure for the precision matrix:

\[\mu(\mathbf{u}) = \Gamma \mathbf{u},\quad \Sigma^{-1}(\mathbf{u}) = \Omega(\mathbf{u}) = B_0 + B_1u_1 + \dotsb + B_qu_q\]

Estimation

\[ \begin{align*} \mathbf{X}\mid\mathbf{U} &=\mathbf{u} \sim \mathcal{N}_p(\mu(\mathbf{u}), \Omega^{-1}(\mathbf{u})) \\ \mu(\mathbf{u}) &= \Gamma\mathbf u \\ \Omega(\mathbf{u}) &= B_0 + B_1u_1 + \dotsb + B_qu_q \end{align*} \]

\(\Omega_{jk}\) is related to the partial correlation of \(X_j\) and \(X_k\). Estimated via nodewise regression (Meinshausen & Bühlmann ’06).

\[ X_j = \mathbf{u}^\top\boldsymbol{\gamma}_j + (\mathbf{X}_{-j} - \mathbf{u}^\top\mathbf{\gamma}_j)^\top \boldsymbol{\beta}_{j, 0} + \sum_{h=1}^q \underbrace{u_h \times (\mathbf{X}_{-j} - \mathbf{u}^\top\boldsymbol{\gamma}_j)^\top}_{\text{interaction term}} \boldsymbol{\beta}_{j,h} + \epsilon_j \]

This regression problem is not jointly convex in \(\boldsymbol{\gamma}_j\) and \(\boldsymbol{\beta}_j\).

Convex Formulation

  • Start with \(\mathbf{X} \sim \mathcal{N}_p(\mu, \Omega^{-1})\).

  • Define the natural parameters \(\mathbf{\theta} = \operatorname{diag}(\Omega)^{-1}\Omega\mathbf{\mu}\) and \(\Theta = -\operatorname{diag}(\Omega)^{-1}\Omega\).

  • It follows that \(X_j \mid \mathbf{X}_{-j} = \theta_j + \Theta_{j,-j}\mathbf{X}_{-j} + \epsilon_j, \quad \epsilon_j \sim \mathcal{N}(0, 1/\Omega_{jj})\).

Our model specified as

\[ \begin{align*} \mathbf{X}\mid\mathbf{U} &=\mathbf{u} \sim \mathcal{N}_p(\mu(\mathbf{u}), \Omega^{-1}(\mathbf{u})) \\ \theta(\mathbf{u}) &= \Gamma\mathbf u \\ \Theta(\mathbf{u}) &= B_0 + B_1u_1 + \dotsb + B_qu_q \end{align*} \]

with corresponding nodewise regression problem

\[X_j = \mathbf{u}^\top\boldsymbol{\gamma}_j + \mathbf{X}_{-j}^\top \boldsymbol{\beta}_{j,0} + \sum_{h=1}^q u_h \mathbf{X}_{-j}^\top \boldsymbol{\beta}_{j,h}+ \epsilon_j\]

which is jointly convex in \(\boldsymbol{\gamma}_j\) and \(\boldsymbol{\beta}_j\).

Sparse group lasso

For each node \(j = 1, \dotsc, p\) we estimate

\[ \begin{align*} \gamma_j \in \mathbb{R}^q, \quad \beta_j = ( \underbrace{\beta_{j10},\dotsc, \beta_{jp0}}_{\text{population effects}}, \underbrace{\beta_{j11},\dotsc, \beta_{jp1}}_{\text{$u_1$ effects}}, \dotsc, \underbrace{\beta_{j1q},\dotsc, \beta_{jpq}}_{\text{$u_q$ effects}} )^\top \in \mathbb{R}^{(p-1)\times (q+1)} \end{align*} \]

Impose the sparse-group lasso penalty:

\[(\hat\gamma_j, \hat\beta_j) = \operatorname*{arg min}_{\gamma_j, \beta_j} \left\{\ell_2(\gamma_j, \beta_j) + \eta\lVert\gamma_j\rVert_1 + \lambda\lVert{\beta_j}\rVert_1 + \lambda_g \sum_{h=1}^q \lVert{\beta_{j,h}}\rVert_2\right\}\]

eQTL Analysis of Glioblastoma Multiforme

We ran our method on a data set of \(n=401\) GBM patients with \(p=73\) genes belonging to GBM pathways and \(q=118\) SNPs local to these genes. Highly connected genes include the calcium binding protein CALML5 and oncogenic transcription factor E2F3, which have roles in explaining GBM biology.

eQTL Analysis of Glioblastoma Multiforme

We ran our method on a data set of \(n=401\) GBM patients with \(p=75\) genes belonging to GBM pathways and \(q=118\) SNPs local to these genes. Highly connected genes include the calcium binding protein CALML5 and oncogenic transcription factor E2F3, which have roles in explaining GBM biology.

eQTL Analysis of Glioblastoma Multiforme

Estimated network effects of two SNPs

  • Left: Variant in BRAF mediates co-expressions of PDGFRA, PDGFB, and SOS2

  • Right: Variant in CALML4 mediates co-expressions of E2F1 and PIK3CA

Theoretical Results

Theorem 1 (\(\ell_2\) error): Under some regularity conditions and the assumption \((\color{red}{s_\beta} + \color{green}{s_\gamma})^2 \log(pq) = o(n)\), we have with high probability that \[\lVert{\hat{\gamma_j}-\gamma_j}\rVert_2^2 + \lVert{\hat{\beta_j}-\beta_j}\rVert_2^2 \precsim \frac{1}{n}\left(\color{green}{s_{\gamma}}\log(eq/\color{green}{s_\gamma}) + \color{red}{s_\beta}\log(ep) + \color{blue}{s_{\beta,g}}\log(eq/\color{blue}{s_{\beta,g}})\right)\]

\[ \begin{align*} \color{green}{s_\gamma} &= \text{number of non-zero elements in $\gamma_j$} \\ \color{red}{s_\beta} &= \text{number of non-zero elements in $\beta_j$} \\ \color{blue}{s_{\beta,g}} &= \text{number of non-zero groups in $\beta_j$} \end{align*} \]

Improved scaling assumption compared to existing methods

\[ (s_\beta + s_\gamma)^2 \log(pq) = o(n) \quad \text{vs} \quad \log(pq) = O(n^{1/6}),\, s_\gamma = o(n^{1/3}), \, s_\beta = o(n^{1/6}) \]

Support Recovery

Our method provides support recovery under an additional signal strength condition.

Theorem 2 (support recovery): If it holds that \(\operatorname{min}_{\ell, k}\{|\gamma_\ell|, |\beta_k|\} \geq K(\color{blue}{\eta + \lambda + \lambda_g})\) then we have \[ \mathbb{P}(\color{green}{\hat S_\gamma = S_\gamma}, \color{red}{\hat S_\beta = S_\beta}) \geq 1 - C_1 \exp(-C_2 \log p) \]

\[ \begin{align*} \color{blue}{\eta, \lambda, \lambda_g} &: \text{sparse-group lasso penalty factors} \\ \color{green}{\hat S_\gamma, S_\gamma} &: \text{estimated and true support of $\gamma_j$} \\ \color{red}{\hat S_\beta, S_\beta} &: \text{estimated and true support of $\beta_j$} \end{align*} \]

Simulation Results

We ran our method ncagr against a benchmark method on a variety of simulated data sets. We are interested in the true positive rate (TPR) and false positive rate (FPR) of detected edges as well as the overall error rate of the nodewise regressions.

method p q TPR FPR Error
ncagr 25 50 0.947 0.002 4.249
RegGMM 25 50 0.822 0.003 4.767
ncagr 50 50 0.554 0.001 12.274
RegGMM 50 50 0.520 0.005 12.656
ncagr 25 100 0.978 0.001 3.775
RegGMM 25 100 0.756 0.001 5.098

Thank you!

  • Zhang, J. and Li, Y. (2022). High-dimensional gaussian graphical regression models with covariates. Journal of the American Statistical Association 0, 1–13

  • Meinshausen, N. and Bühlmann, P. (2006). High-dimensional graphs and variable selection with the Lasso. The Annals of Statistics 34, 1436 – 1462

  • Fehrmann, R. S. N., et al (2011). Trans-eqtls reveal that independent genetic variants associated with a complex phenotype converge on intermediate genes, with a major role for the hla. PLoS Genetics 7, e1002197